knitr::opts_knit$set(warning = FALSE, message= FALSE)
R is a great data analysis and stats language and has lots of practical applications in many businesses. Using it can result in a fantastic return on investment as R is free to use, but you’ll only get that ROI when you’re using it in a robust infrastructure and utilising sensible development practices. This training day takes you through the basics of R development, showing you the best practices along the way, then shows you how to setup the Linux infrastructure needed to do team development and to deliver reporting across the company.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display
git config --global user.name="Your name"git config --global user.email="email@addre.ss"Install Rtraining by executing
library(devtools)
# Case-sensitive!
install_github("stephlocke/Rtaining")| Action | Operator | Example |
|---|---|---|
| Subtract | - | 5 - 4 = 1 |
| Add | + | 5 + 4 = 9 |
| Multiply | * | 5 * 4 = 20 |
| Divide | / | 5 / 4 = 1.25 |
| Raise to the power | ^ | 5 ^ 4 = 625 |
| Modulus | %% | 10 %% 4 = 2 |
| Absolute remainder | %/% | 9 %/% 4 = 2 |
| Basic sequence | : | sum(1:3) = 6 |
| Action | Operator | Example |
|---|---|---|
| Less than | < | 5 < 5 = FALSE |
| Less than or equal to | <= | 5 <= 5 = TRUE |
| Greater than | > | 5 > 5 = FALSE |
| Greater than or equal to | >= | 5 >= 5 = TRUE |
| Exactly equal | == | (0.5 - 0.3) == (0.3 - 0.1) is FALSE, 2 == 2 is TRUE |
| Not equal | != | (0.5 - 0.3) != (0.3 - 0.1) is TRUE, 2 != 2 is FALSE |
| Equal | all.equal() | all.equal(0.5 - 0.3,0.3 - 0.1) is TRUE |
| States | Representation |
|---|---|
| True | TRUE , 1 |
| False | FALSE , 0 |
| Empty | NULL |
| Unknown | NA |
Not a number e.g. 0/0 |
NaN |
Infinite e.g. 1/0 |
Inf |
| Action | Operator | Example |
|---|---|---|
| Not | ! | !TRUE is FALSE |
| And | & | TRUE & FALSE is FALSE, c(TRUE,TRUE) & c(FALSE,TRUE) is FALSE, TRUE |
| Or | | |
TRUE | FALSE is TRUE, c(TRUE,FALSE) | c(FALSE,FALSE) is TRUE, FALSE |
| Xor | xor() | xor(TRUE,FALSE) is TRUE |
| Bitwise And | && | c(TRUE,TRUE) && c(FALSE,TRUE) is FALSE |
| Bitwise Or | || |
c(TRUE,FALSE) || c(FALSE,FALSE) is TRUE |
| In | %in% | “Red” %in% c(“Blue”,“Red”) = TRUE |
| Not in | !( x %in% y) or Hmisc::%nin% | “Red” %nin% c(“Blue”,“Red”) = FALSE |
| Type | Implementation | Example |
|---|---|---|
| If | if(condition) {dosomething} | if(TRUE) { 2 } returns 2 |
| If else | if(condition) {do something} else {do something different} or ifelse(condition, do something, do something else) | if(FALSE) { 2 } else { 3 } returns 3, ifelse(FALSE, 2, 3) returns 3 |
| For loop | for(i in seq) {dosomething} or foreach::foreach(i=1:3) %do% {something} | foreach(i=1:3) %do% {TRUE} returns TRUE, TRUE, TRUE |
| While loop | while(condition) {do something } | a<-0 ; while(a<3){a<-a+1} ; a returns 3 |
| Switch | switch(value, …) | switch(2, “a”, “b”) returns b |
| Case | memisc::cases(…) | cases(“pi<3”=pi<3, “pi=3”=pi==3,“pi>3”=pi>3) returns pi>3 |
NB: If you find yourself using a loop, there’s probably a better, faster solution
| Action | Operator | Example |
|---|---|---|
| Create / update a variable | <- |
a <- 10 |
NB: There are others you could use, but this is the best practice
| Action | Operator | Example |
|---|---|---|
| Use public function from package | :: | memisc::cases() |
| Use private function from package | ::: | optiRum:::pounds_format() |
| Get a component e.g a data.frame column | $ | iris$Sepal.Length |
| Extract a property from a class | @ |
Won’t be used in this course |
| Refer to positions in a data.frame or vector | [ ] | iris[5:10,1] |
| Refer to item in a list | [[ ]] | list(iris=iris,mtcars=mtcars)[[“iris”]] |
| Data type | Example |
|---|---|
| Integer | 1 |
| Logical | TRUE |
| Numeric | 1.1 |
| String / character | “Red” |
| Factor (enumerated string) | “Amber” or 2 in c(“Red”,“Amber”,“Green”) |
| Complex | i |
| Date | “2015-04-24” |
| Data type | Info | Construction example(s) |
|---|---|---|
| Vector | A 1D set of values of the same data type | c(1,“a”) , 1:3 , LETTERS |
| Matrix | A 2D set of values of the same data type | matrix(LETTERS,nrow=13, ncol=2) , rbind(1:5,2:6) |
| Array | An nD set of values of the same data type | array(LETTERS, c(13,2)) |
| Data.frame | A 2D set of values of different data types | data.frame(a=1:26, b=LETTERS) |
| List | A collection of objects of various data types | list(vector=c(1,“a”), df=data.frame(a=1:6)) |
| Classes | A class is like a formalised list and can also contain functions i.e. methods | Won’t be covered in this class |
NB: Most of my work uses vectors, data.tables (a souped up version of data.frames), and lists
| Function | Use |
|---|---|
| is.[data type] | Whether a vector is of a particular type |
| as.[data type] | Attempts to coerce a vector to a data type |
| str | Structure of an object including class/data type, dimensions |
| class | The class(es)/data type(s) an object belongs to |
| summary | Summarises an object |
| dput | Get R code that recreates an object |
| unlist | Simplify a list to a vector |
| dim | Dimensions of a data type |
| Format | Functions |
|---|---|
| CSV | read.csv , data.table::fread , readr::read_csv |
| Excel | readxl::read_excel |
| Database | RODBC::sqlQuery , DBI::dbGetQuery |
| SPSS / SAS / Stata | haven::read_[prog] |
| Hadoop | rHadoopClient::read.hdfs |
| NoSQL | mongodb::mongo.find , RNeo4Jj::getNodes |
| Format | Functions |
|---|---|
| CSV | write.csv |
| Excel | . |
| Database | RODBC::sqlSave , DBI::dbGetQuery |
| SPSS / SAS / Stata | . |
| Hadoop | . |
| NoSQL | . |
DT[i, j, by]
DT[WHERE | JOIN | ORDER, SELECT | UPDATE, GROUP]
A data.table acts like an in-memory RDBMS:
There are some differences that need to be mentioned:
| Task | Generic syntax | Example(s)* |
|---|---|---|
| CREATE | data.table(…) | data.table(a=1:3 , b=LETTERS[1:3]) data.table(iris) |
| PRIMARY KEY | data.table(…,key) setkey() | data.table(a=1:3 , b=LETTERS[1:3], key="b") setkey(data.table(iris),Species) |
| SELECT basic | DT[ , .( cols )] | irisDT[ , .(Species, Sepal.Length)] |
| SELECT alias | DT[ , .( a=col )] | irisDT[ , .(Species, Length=Sepal.Length)] |
| SELECT COUNT | DT[ , .N] | irisDT[ ,.N] |
| SELECT COUNT DISTINCT | DT[ , uniqueN(cols)] | irisDT[ ,uniqueN(.SD)] |
| SELECT aggregation | DT[ , .( sum(col) , .N )] | irisDT[ , .(Count=.N, Length=mean(Sepal.Length))] |
| SELECT dynamically i.e. by reference | DT[ , colnames , with=FALSE] | cols<-colnames(irisDT); irisDT[ , cols, with=FALSE] |
| WHERE exact on primary key | DT[value] DT[value, ] | irisDT["setosa"] irisDT["setosa", .(Count=.N)] |
| WHERE | DT[condition] DT[condition, j, by] | irisDT[Species=="setosa"] irisDT[Species=="setosa", .(Count=.N)] |
| WHERE BETWEEN | DT[between(col, min, max)] DT[ col %between% c(min,max) ] | irisDT[between(Sepal.Length, 1, 5)] irisDT[Sepal.Length %between% c(1,5)] |
| WHERE LIKE | DT[like(col,pattern)] DT[ col %like% pattern ] | irisDT[like(Species,"set")] irisDT[Species %like% "set"] |
| ORDER asc. | DT[order(cols)] DT[order(cols), j, by] | irisDT[order(Species)] |
| ORDER desc. | DT[order(-cols)] DT[order(-cols), j, by] | irisDT[order(-Species)] |
| ORDER multiple | DT[order(cols)] DT[order(cols), j, by] | irisDT[order(-Species, Petal.Width)] |
| GROUP BY single | DT[i, j, by] | irisDT[ ,.N, by=Species] |
| GROUP BY multiple | DT[i, j, by] | irisDT[ ,.N, by=.(Species,Width=Petal.Width)] |
| TOP | head(DT, n) | head(irisDT) |
| HAVING | DT[i, j, by][condition] | irisDT[ , .(Count=.N), by=Species][Count>25] |
| Sub-queries | DT[…][…][…] | irisDT[ , .(Sepal.Length=mean(Sepal.Length)), by=Species][Sepal.Length>6, .(Species)] |
* Uses irisDT <- data.table(iris)
| Task | Generic syntax | Example(s)* |
|---|---|---|
| INSERT | DT <- rbindlist(DT, newDT) | irisDT<-rbindlist( irisDT, irisDT[1] ) |
| READ aka SELECT (see above) | DT[ , .( cols )] | irisDT[ , .(Species, Sepal.Length)] |
| UPDATE / ADD column | DT[ , a := b ] | irisDT[ , Sepal.Area := Sepal.Width * Sepal.Length] |
| UPDATE / ADD multiple columns | DT[ , `:=`(a = b, c = d) ] | irisDT[ , `:=`(CreatedDate = Sys.Date(), User = "Steph")] |
| UPDATE / ADD multiple columns by reference | DT[ , (newcols):=vals ] | irisDT[ , c("a","b"):=.(1,2)] |
| DELETE | DT <- DT[!condition] | irisDT <- irisDT[!(Species=="setosa" & Petal.Length>=1.5)] |
* Uses irisDT <- data.table(iris)
| Task | Generic syntax | Example(s)* |
|---|---|---|
| Structure | str(DT) | str(irisDT) |
| Column Names | colnames(DT) | colnames(irisDT) |
| Summary stats | summary(DT) | summary(irisDT) |
| Retrieve primary key info | key(DT) | key(irisDT) |
* Uses irisDT <- data.table(iris)
| Task | Generic syntax | Example(s)* |
|---|---|---|
| INNER JOIN | Y[X, nomatch=0L] | lookupDT[irisDT,nomatch=0] |
| LEFT JOIN | Y[X] | lookupDT[irisDT] |
| FULL JOIN | merge(X, Y, all=TRUE) | merge(irisDT, lookupDT, all=TRUE) |
| CROSS JOIN | optiRum::CJ.dt(X,Y) | CJ.dt(irisDT, lookupDT) |
| UNION ALL | rbindlist( list(X,Y), fill=TRUE ) | rbindlist( list(irisDT, lookupDT), fill=TRUE ) |
| UNION | unique( rbindlist( list(X,Y), fill=TRUE ) ) | unique( rbindlist( list(irisDT, lookupDT), fill=TRUE ) ) |
| JOIN and AGGREGATE | Y[X, cols, by] | lookupDT[irisDT,.(count=.N),by=Band] |
* Uses:
irisDT <- data.table(iris, key="Species")
lookupDT <- data.table(Species=c("setosa", "virginica", "Blah"), Band=c("A", "B", "A"), key="Species")
| Task | Generic syntax | Example(s)* |
|---|---|---|
| UPDATE / ADD column of summary stat | DT[ , a := b ] | irisDT[ , All.SL.Mean:=mean(Sepal.Length)] |
| UPDATE / ADD column by group | DT[ , a := b, by] | irisDT[ , Species.SL.Mean:=mean(Sepal.Length), by=Species] |
| TOP by group | DT[ , head(.SD), by] | irisDT[ , head(.SD,2) , by=Species] |
| Largest record | DT[ which.max(col) ] | irisDT[ which.max(Sepal.Length) ] |
| Largest record by group | DT[ , .SD[ which.max(col) ], by] | irisDT[ , .SD[ which.max(Sepal.Length) ], by=Species] |
| Cumulative total | DT[ , cumsum(col) ] | irisDT[ , cumsum(Sepal.Width)] |
| NEGATIVE SELECT | DT[ , .SD, .SDcols=-“colname”] | irisDT[ , .SD, .SDcols=-"Species"] |
| RANK | DT[ , frank(col) ] | irisDT[ , frank(Sepal.Length,ties.method="first")] |
| AGGREGATE multiple columns | DT[ , lapply(.SD, sum)] | irisDT[ , lapply(.SD,sum), .SDcols=-"Species"] |
| AGGREGATE multiple columns by group | DT[ , lapply(.SD, sum), by] | irisDT[ , lapply(.SD,sum), by=Species] |
| COUNT DISTINCT multiple columns by group | DT[ , lapply(.SD, uniqueN), by] | irisDT[ , lapply(.SD,uniqueN), by=Species] |
| COUNT NULL multiple columns by group | DT[ , lapply(.SD, function(x) sum(is.na(x))), by] | irisDT[ , lapply(.SD,function(x) sum(is.na(x))), by=Species] |
| PIVOT data - to single value column | melt(DT,…) | melt(irisDT) |
| PIVOT data - to aggregate | dcast(DT, a~b, function) | dcast(melt(irisDT), Species ~ variable, sum) |
* Uses irisDT <- data.table(iris)
| Task | Generic syntax | Example(s)* |
|---|---|---|
| GROUP BY each new incidence of group | DT[ , cols , by=(col, rleid(col))] | irisDT[order(Sepal.Length), .N, by=.(Species, rleid(Species))] |
| Calculate using (previous/next/n) row | DT[ , col / shift( cols, n, fill, type)] | irisDT[ , prev.Sepal.Length:=shift(Sepal.Length), by=Species ] |
* Uses irisDT <- data.table(iris)
This intro covers the charting package ggplot2.
The “base” charting functionality will not be covered because it’s much more difficult to achieve good looking results quickly and I don’t believe in that much effort for so little benefit!
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
| Term | Explanation | Example(s) |
|---|---|---|
| plot | A plot using the grammar of graphics | ggplot() |
| aesthetics | attributes of the chart | colour, x, y |
| mapping | relating a column in your data to an aesthetic | |
| statistical transformation | a translation of the raw data into a refined summary | stat_density() |
| geometry | the display of aesthetics | geom_line(), geom_bar() |
| scale | the range of values | axes, legends |
| coordinate system | how geometries get laid out | coord_flip() |
| facet | a means of subsetting the chart | facet_grid() |
| theme | display properties | theme_minimal() |
library(ggplot2)
p <- ggplot(data=iris)
p <- ggplot(data=iris, aes(x=Sepal.Width, y=Sepal.Length, colour=Species))
p <- p + geom_point()
p
p <- p + stat_boxplot(fill="transparent")
p
## Warning: position_dodge requires constant width: output may be incorrect
## Warning: position_dodge requires non-overlapping x intervals
p <- p + coord_flip()
p
## Warning: position_dodge requires constant width: output may be incorrect
## Warning: position_dodge requires non-overlapping x intervals
p <- p + facet_grid(.~Species)
p
p <- p + optiRum::theme_optimum()
p
ggplot(data=iris, aes(x=Sepal.Width, y=Sepal.Length, colour=Species)) +
geom_point() +
stat_boxplot(fill="transparent") +
# coord_flip() + # Commented out
facet_grid(.~Species) +
optiRum::theme_optimum()
Producing documents / documentation directly in R means that you closely interweave (knit) your analysis and R code together. This reduces rework time when you want to change or extend your code, it reduces time to produce new versions, and because it’s code it’s easier to apply strong software development principles to it.
Oh, and you don’t need to spend hours making text boxes in powerpoint! Win ;-)
There are two languages which you can knit your r code into:
Markdown is great for very quick generation and light (or css driven) styling and is what this section focusses on. LaTeX is excellent for producing stunning, more flexible documents.
The following text is the default text that gets created when you produce a new rmarkdown file in rstudio
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
The following text is part of the standard documentation on rmarkdown. I pull it from github.com/rstudio/rmarkdown and integrate it using knitr. It is better than I could produce and the act of integrating it gives an extra example of useful ways to build documents.
This document provides quick references to the most commonly used R Markdown syntax. See the following articles for more in-depth treatment of all the capabilities of R Markdown:
*italic* **bold**
_italic_ __bold__
# Header 1
## Header 2
### Header 3
Unordered List:
* Item 1
* Item 2
+ Item 2a
+ Item 2b
Ordered List:
1. Item 1
2. Item 2
3. Item 3
+ Item 3a
+ Item 3b
R code will be evaluated and printed
```{r}
summary(cars$dist)
summary(cars$speed)
```
There were 50 cars studied
Use a plain http address or add a link to a phrase:
http://example.com
[linked phrase](http://example.com)
Images on the web or local files in the same directory:


A friend once said:
> It's always better to give
> than to receive.
Plain code blocks are displayed in a fixed-width font but not evaulated
```
This text is displayed verbatim / preformatted
```
We defined the `add` function to
compute the sum of two numbers.
LaTeX Equations
Inline equation:
$equation$
Display equation:
$$ equation $$
Three or more asterisks or dashes:
******
------
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
Reference Style Links and Images
A [linked phrase][id].
At the bottom of the document:
[id]: http://example.com/ "Title"
Images
![alt text][id]
At the bottom of the document:
[id]: figures/img.png "Title"
End a line with two or more spaces:
Roses are red,
Violets are blue.
superscript^2^
~~strikethrough~~
library(data.table)
library(shiny)
defaultdisplay<-list(
width="100%", height="75%"
)
shinyAppDir(
system.file("examples/06_tabsets", package="shiny"),
options = defaultdisplay
)
A shiny application report consists of two functions:
shinyServer()shinyUI()One says what to execute and the other states how to present it. Do all data manipulation, chart production in shinyServer()
defaultdisplay<-list(width="100%", height="75%")
shinyApp(
ui = fluidPage(),
, server = function(input, output) {}
, options = defaultdisplay
)
You typically split into two files:
shinyServer()shinyUI()This can then be run with runApp()
You can do a single file example app.R which contains both functions but this is typically better for very short apps.
Use these just inside shinyUI() to produce a layout
## Page Types
## 1: basicPage
## 2: bootstrapPage
## 3: fixedPage
## 4: fluidPage
## 5: navbarPage
shinyApp(
ui = fluidPage(dateInput("datePicker", "Pick a date:",
format="dd/mm/yy"),
dateRangeInput("dateRange", "Pick dates:",
start=Sys.Date(),
end=Sys.Date() ) ),
server = function(input, output) {}
,options = defaultdisplay
)
Basic
shinyApp(
ui = fluidPage(numericInput("vals", "Insert a number:",
value=15, min=10) ),
server = function(input, output) {}
,options = defaultdisplay
)
Sliders
shinyApp(
ui = fluidPage(sliderInput("vals", "Insert a number:",
min=0, max=50, value=15) ),
server = function(input, output) {}
,options = defaultdisplay
)
A single line
shinyApp(
ui = fluidPage(textInput("char", "Insert text:") ),
server = function(input, output) {}
,options = defaultdisplay
)
A paragraph
shinyApp(
ui = fluidPage(tags$textarea(id="charbox", rows=3,
cols=40, "Default value") ),
server = function(input, output) {}
,options = defaultdisplay
)
shinyApp(
ui = fluidPage(selectInput("multiselect", "Pick favourites:",
c("Green","Red","Blue"),
multiple=TRUE) ),
server = function(input, output) {}
,options = defaultdisplay
)
## Input controls
## 1: checkboxGroupInput
## 2: checkboxInput
## 3: dateInput
## 4: dateRangeInput
## 5: fileInput
## 6: numericInput
## 7: passwordInput
## 8: registerInputHandler
## 9: removeInputHandler
## 10: selectInput
## 11: selectizeInput
## 12: sliderInput
## 13: textInput
## 14: updateCheckboxGroupInput
## 15: updateCheckboxInput
## 16: updateDateInput
## 17: updateDateRangeInput
## 18: updateNumericInput
## 19: updateSelectInput
## 20: updateSelectizeInput
## 21: updateSliderInput
## 22: updateTextInput
## Input controls
shinyApp(
ui = fluidPage(textInput("char", "Insert text:") ,
textOutput("text") ),
server = function(input, output) {
output$text <- renderText(input$char)
} ,options = defaultdisplay
)
shinyApp(
ui = fluidPage(tableOutput("basictable") ),
server = function(input, output) {
output$basictable <- renderTable(head(iris,5))
} ,options = defaultdisplay
)
shinyApp(
ui = fluidPage(dataTableOutput("datatable") ),
server = function(input, output) {
output$datatable <- renderDataTable(head(iris,5))
} ,options = defaultdisplay
)
shinyApp(
ui = fluidPage(plotOutput("chart") ),
server = function(input, output) {
output$chart <- renderPlot(pairs(iris))
} ,options = defaultdisplay
)
a <- reactive({input$a})
a
shinyApp(
ui = fluidPage(textInput("char", "Insert text:") ,
textOutput("textA"),textOutput("textB") ),
server = function(input, output) {
char<-reactive({rep(input$char,5)})
output$textA <- renderText(paste(char(),collapse="+"))
output$textB <- renderText(paste(char(),collapse="-"))
}
,options = defaultdisplay
)
shinythemesrvestshiny::runApp()shinyApps packageAzure portal, using gallery creation for VM
sudo apt-get update to get the package repository metadatasudo apt-get install r-base to get R. Will have lots of extra associated packages - select Y when promptedsudo su - -c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\"" to install shiny in Rsudo apt-get install gdebi-core to enable processing of shiny-server installation packagewget http://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.3.0.403-amd64.debsudo gdebi shiny-server-1.3.0.403-amd64.debsudo nano /etc/shiny-server/shiny-server.confsudo restart shiny-server